Goto

Collaborating Authors

 Oliver County


Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language

C, Jesus Alvarez, Karajeanes, Daua D., Prado, Ashley Celeste, Ruttan, John, Yang, Ivory, O'Brien, Sean, Sharma, Vasu, Zhu, Kevin

arXiv.org Artificial Intelligence

The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.


FOCUS on Contamination: A Geospatial Deep Learning Framework with a Noise-Aware Loss for Surface Water PFAS Prediction

Khan, Jowaria, Friedman, Alexa, Evans, Sydney, Wang, Runzi, Beins, Kaley, Andrews, David, Bondi-Kelly, Elizabeth

arXiv.org Artificial Intelligence

Per and polyfluoroalkyl substances (PFAS), chemicals found in products like non-stick cookware, are unfortunately persistent environmental pollutants with severe health risks. Accurately mapping PFAS contamination is crucial for guiding targeted remediation efforts and protecting public and environmental health, yet detection across large regions remains challenging due to the cost of testing and the difficulty of simulating their spread. In this work, we introduce FOCUS, a geospatial deep learning framework with a label noise-aware loss function, to predict PFAS contamination in surface water over large regions. By integrating hydrological flow data, land cover information, and proximity to known PFAS sources, our approach leverages both spatial and environmental context to improve prediction accuracy. We evaluate the performance of our approach through extensive ablation studies and comparative analyses against baselines like sparse segmentation, as well as existing scientific methods, including Kriging and pollutant transport simulations. Results highlight our framework's potential for scalable PFAS monitoring.


Boosting of Classification Models with Human-in-the-Loop Computational Visual Knowledge Discovery

Williams, Alice, Kovalerchuk, Boris

arXiv.org Artificial Intelligence

High-risk artificial intelligence and machine learning classification tasks, such as healthcare diagnosis, require accurate and interpretable prediction models. However, classifier algorithms typically sacrifice individual case-accuracy for overall model accuracy, limiting analysis of class overlap areas regardless of task significance. The Adaptive Boosting meta-algorithm, which won the 2003 G\"odel Prize, analytically assigns higher weights to misclassified cases to reclassify. However, it relies on weaker base classifiers that are iteratively strengthened, limiting improvements from base classifiers. Combining visual and computational approaches enables selecting stronger base classifiers before boosting. This paper proposes moving boosting methodology from focusing on only misclassified cases to all cases in the class overlap areas using Computational and Interactive Visual Learning (CIVL) with a Human-in-the-Loop. It builds classifiers in lossless visualizations integrating human domain expertise and visual insights. A Divide and Classify process splits cases to simple and complex, classifying these individually through computational analysis and data visualization with lossless visualization spaces of Parallel Coordinates or other General Line Coordinates. After finding pure and overlap class areas simple cases in pure areas are classified, generating interpretable sub-models like decision rules in Propositional and First-order Logics. Only multidimensional cases in the overlap areas are losslessly visualized simplifying end-user cognitive tasks to identify difficult case patterns, including engineering features to form new classifiable patterns. Demonstration shows a perfectly accurate and losslessly interpretable model of the Iris dataset, and simulated data shows generalized benefits to accuracy and interpretability of models, increasing end-user confidence in discovered models.


ReSpark: Leveraging Previous Data Reports as References to Generate New Reports with LLMs

Tian, Yuan, Zhang, Chuhan, Wang, Xiaotong, Pan, Sitong, Cui, Weiwei, Zhang, Haidong, Deng, Dazhen, Wu, Yingcai

arXiv.org Artificial Intelligence

Creating data reports is time-consuming, as it requires iterative exploration and understanding of data, followed by summarizing the insights. While large language models (LLMs) are powerful tools for data processing and text generation, they often struggle to produce complete data reports that fully meet user expectations. One significant challenge is effectively communicating the entire analysis logic to LLMs. Moreover, determining a comprehensive analysis logic can be mentally taxing for users. To address these challenges, we propose ReSpark, an LLM-based method that leverages existing data reports as references for creating new ones. Given a data table, ReSpark searches for similar-topic reports, parses them into interdependent segments corresponding to analytical objectives, and executes them with new data. It identifies inconsistencies and customizes the objectives, data transformations, and textual descriptions. ReSpark allows users to review real-time outputs, insert new objectives, and modify report content. Its effectiveness was evaluated through comparative and user studies.


Predicting Barge Presence and Quantity on Inland Waterways using Vessel Tracking Data: A Machine Learning Approach

Agorkua, Geoffery, Hernandez, Sarah, Falquez, Maria, Poddar, Subhadipto, Pang, Shihao

arXiv.org Artificial Intelligence

This study presents a machine learning approach to predict the number of barges transported by vessels on inland waterways using tracking data from the Automatic Identification System (AIS). While AIS tracks the location of tug and tow vessels, it does not monitor the presence or number of barges transported by those vessels. Understanding the number and types of barges conveyed along river segments, between ports, and at ports is crucial for estimating the quantities of freight transported on the nation's waterways. This insight is also valuable for waterway management and infrastructure operations impacting areas such as targeted dredging operations, and data-driven resource allocation. Labeled sample data was generated using observations from traffic cameras located along key river segments and matched to AIS data records. A sample of 164 vessels representing up to 42 barge convoys per vessel was used for model development. The methodology involved first predicting barge presence and then predicting barge quantity. Features derived from the AIS data included speed measures, vessel characteristics, turning measures, and interaction terms. For predicting barge presence, the AdaBoost model achieved an F1 score of 0.932. For predicting barge quantity, the Random Forest combined with an AdaBoost ensemble model achieved an F1 score of 0.886. Bayesian optimization was used for hyperparameter tuning. By advancing predictive modeling for inland waterways, this study offers valuable insights for transportation planners and organizations, which require detailed knowledge of traffic volumes, including the flow of commodities, their destinations, and the tonnage moving in and out of ports.


Creating a Cooperative AI Policymaking Platform through Open Source Collaboration

Lewington, Aiden, Vittalam, Alekhya, Singh, Anshumaan, Uppuluri, Anuja, Ashok, Arjun, Athmaram, Ashrith Mandayam, Milt, Austin, Smith, Benjamin, Weinberger, Charlie, Sarin, Chatanya, Bergmeir, Christoph, Chang, Cliff, Patel, Daivik, Li, Daniel, Bell, David, Cao, Defu, Shin, Donghwa, Kang, Edward, Zhang, Edwin, Li, Enhui, Chen, Felix, Smithline, Gabe, Chen, Haipeng, Gasztowtt, Henry, Shin, Hoon, Zhang, Jiayun, Gray, Joshua, Low, Khai Hern, Patel, Kishan, Cooke, Lauren Hannah, Burstein, Marco, Kalapatapu, Maya, Mittal, Mitali, Chen, Raymond, Zhao, Rosie, Majid, Sameen, Potlapalli, Samya, Wang, Shang, Patel, Shrenik, Li, Shuheng, Komaragiri, Siva, Lu, Song, Siangjaeo, Sorawit, Jung, Sunghoo, Zhang, Tianyu, Mao, Valery, Krishnakumar, Vikram, Zhu, Vincent, Kam, Wesley, Li, Xingzhe, Liu, Yumeng

arXiv.org Artificial Intelligence

Advances in artificial intelligence (AI) present significant risks and opportunities, requiring improved governance to mitigate societal harms and promote equitable benefits. Current incentive structures and regulatory delays may hinder responsible AI development and deployment, particularly in light of the transformative potential of large language models (LLMs). To address these challenges, we propose developing the following three contributions: (1) a large multimodal text and economic-timeseries foundation model that integrates economic and natural language policy data for enhanced forecasting and decision-making, (2) algorithmic mechanisms for eliciting diverse and representative perspectives, enabling the creation of data-driven public policy recommendations, and (3) an AI-driven web platform for supporting transparent, inclusive, and data-driven policymaking.


Can Language Models Reason about Individualistic Human Values and Preferences?

Jiang, Liwei, Sorensen, Taylor, Levine, Sydney, Choi, Yejin

arXiv.org Artificial Intelligence

Recent calls for pluralistic alignment emphasize that AI systems should address the diverse needs of all people. Yet, efforts in this space often require sorting people into fixed buckets of pre-specified diversity-defining dimensions (e.g., demographics, personalities, communication styles), risking smoothing out or even stereotyping the rich spectrum of individualistic variations. To achieve an authentic representation of diversity that respects individuality, we propose individualistic alignment. While individualistic alignment can take various forms, in this paper, we introduce IndieValueCatalog, a dataset transformed from the influential World Values Survey (WVS), to study language models (LMs) on the specific challenge of individualistic value reasoning. Specifically, given a sample of an individual's value-expressing statements, models are tasked with predicting their value judgments in novel cases. With IndieValueCatalog, we reveal critical limitations in frontier LMs' abilities to reason about individualistic human values with accuracies, only ranging between 55% to 65%. Moreover, our results highlight that a precise description of individualistic values cannot be approximated only via demographic information. We also identify a partiality of LMs in reasoning about global individualistic values, as measured by our proposed Value Inequity Index ({\sigma}INEQUITY). Finally, we train a series of Individualistic Value Reasoners (IndieValueReasoner) using IndieValueCatalog to enhance models' individualistic value reasoning capability, revealing new patterns and dynamics into global human values. We outline future research challenges and opportunities for advancing individualistic alignment.


The Role of AI in Peer Support for Young People: A Study of Preferences for Human- and AI-Generated Responses

Young, Jordyn, Jawara, Laala M, Nguyen, Diep N, Daly, Brian, Huh-Yoo, Jina, Razi, Afsaneh

arXiv.org Artificial Intelligence

Generative Artificial Intelligence (AI) is integrated into everyday technology, including news, education, and social media. AI has further pervaded private conversations as conversational partners, auto-completion, and response suggestions. As social media becomes young people's main method of peer support exchange, we need to understand when and how AI can facilitate and assist in such exchanges in a beneficial, safe, and socially appropriate way. We asked 622 young people to complete an online survey and evaluate blinded human- and AI-generated responses to help-seeking messages. We found that participants preferred the AI-generated response to situations about relationships, self-expression, and physical health. However, when addressing a sensitive topic, like suicidal thoughts, young people preferred the human response. We also discuss the role of training in online peer support exchange and its implications for supporting young people's well-being. Disclaimer: This paper includes sensitive topics, including suicide ideation. Reader discretion is advised.


PharmacoNet: Accelerating Large-Scale Virtual Screening by Deep Pharmacophore Modeling

Seo, Seonghwan, Kim, Woo Youn

arXiv.org Artificial Intelligence

As the size of accessible compound libraries expands to over 10 billion, the need for more efficient structure-based virtual screening methods is emerging. Different pre-screening methods have been developed for rapid screening, but there is still a lack of structure-based methods applicable to various proteins that perform protein-ligand binding conformation prediction and scoring in an extremely short time. Here, we describe for the first time a deep-learning framework for structure-based pharmacophore modeling to address this challenge. We frame pharmacophore modeling as an instance segmentation problem to determine each protein hotspot and the location of corresponding pharmacophores, and protein-ligand binding pose prediction as a graph-matching problem. PharmacoNet is significantly faster than state-of-the-art structure-based approaches, yet reasonably accurate with a simple scoring function. Furthermore, we show the promising result that PharmacoNet effectively retains hit candidates even under the high pre-screening filtration rates. Overall, our study uncovers the hitherto untapped potential of a pharmacophore modeling approach in deep learning-based drug discovery.


Parallel Coordinates for Discovery of Interpretable Machine Learning Models

Hayes, Dustin, Kovalerchuk, Boris

arXiv.org Artificial Intelligence

This work uses visual knowledge discovery in parallel coordinates to advance methods of interpretable machine learning. The graphic data representation in parallel coordinates made the concepts of hypercubes and hyperblocks (HBs) simple to understand for end users. It is suggested to use mixed and pure hyperblocks in the proposed data classifier algorithm Hyper. It is shown that Hyper models generalize decision trees. The algorithm is presented in several settings and options to discover interactively or automatically overlapping or non-overlapping hyperblocks. Additionally, the use of hyperblocks in conjunction with language descriptions of visual patterns is demonstrated. The benchmark data from the UCI ML repository were used to evaluate the Hyper algorithm. It enabled the discovery of mixed and pure HBs evaluated using 10-fold cross validation. Connections among hyperblocks, dimension reduction and visualization have been established. The capability of end users to find and observe hyperblocks, as well as the ability of side-by-side visualizations to make patterns evident, are among major advantages ofhyperblock technology and the Hyper algorithm. A new method to visualize incomplete n-D data with missing values is proposed, while the traditional parallel coordinates do not support it. The ability of HBs to better prevent both overgeneralization and overfitting of data over decision trees is demonstrated as another benefit of the hyperblocks. The features of VisCanvas 2.0 software tool that implements Hyper technology are presented.